7 research outputs found
Co-Fusion: Real-time Segmentation, Tracking and Fusion of Multiple Objects
In this paper we introduce Co-Fusion, a dense SLAM system that takes a live
stream of RGB-D images as input and segments the scene into different objects
(using either motion or semantic cues) while simultaneously tracking and
reconstructing their 3D shape in real time. We use a multiple model fitting
approach where each object can move independently from the background and still
be effectively tracked and its shape fused over time using only the information
from pixels associated with that object label. Previous attempts to deal with
dynamic scenes have typically considered moving regions as outliers, and
consequently do not model their shape or track their motion over time. In
contrast, we enable the robot to maintain 3D models for each of the segmented
objects and to improve them over time through fusion. As a result, our system
can enable a robot to maintain a scene description at the object level which
has the potential to allow interactions with its working environment; even in
the case of dynamic scenes.Comment: International Conference on Robotics and Automation (ICRA) 2017,
http://visual.cs.ucl.ac.uk/pubs/cofusion,
https://github.com/martinruenz/co-fusio
Learning Neural Parametric Head Models
We propose a novel 3D morphable model for complete human heads based on hybrid neural fields. At the core of our model lies a neural parametric representation that disentangles identity and expressions in disjoint latent spaces. To this end, we capture a person's identity in a canonical space as a signed distance field (SDF), and model facial expressions with a neural deformation field. In addition, our representation achieves high-fidelity local detail by introducing an ensemble of local fields centered around facial anchor points. To facilitate generalization, we train our model on a newly-captured dataset of over 3700 head scans from 203 different identities using a custom high-end 3D scanning setup. Our dataset significantly exceeds comparable existing datasets, both with respect to quality and completeness of geometry, averaging around 3.5M mesh faces per scan 1 1 We will publicly release our dataset along with a public benchmark for both neural head avatar construction as well as an evaluation on a hidden test-set for inference-time fitting.. Finally, we demonstrate that our approach outperforms state-of-the-art methods in terms of fitting error and reconstruction quality
FroDO: From Detections to 3D Objects
Object-oriented maps are important for scene understanding since they jointly
capture geometry and semantics, allow individual instantiation and meaningful
reasoning about objects. We introduce FroDO, a method for accurate 3D
reconstruction of object instances from RGB video that infers object location,
pose and shape in a coarse-to-fine manner. Key to FroDO is to embed object
shapes in a novel learnt space that allows seamless switching between sparse
point cloud and dense DeepSDF decoding. Given an input sequence of localized
RGB frames, FroDO first aggregates 2D detections to instantiate a
category-aware 3D bounding box per object. A shape code is regressed using an
encoder network before optimizing shape and pose further under the learnt shape
priors using sparse and dense shape representations. The optimization uses
multi-view geometric, photometric and silhouette losses. We evaluate on
real-world datasets, including Pix3D, Redwood-OS, and ScanNet, for single-view,
multi-view, and multi-object reconstruction.Comment: To be published in CVPR 2020. The first two authors contributed
equall
HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion
Representing human performance at high-fidelity is an essential building
block in diverse applications, such as film production, computer games or
videoconferencing. To close the gap to production-level quality, we introduce
HumanRF, a 4D dynamic neural scene representation that captures full-body
appearance in motion from multi-view video input, and enables playback from
novel, unseen viewpoints. Our novel representation acts as a dynamic video
encoding that captures fine details at high compression rates by factorizing
space-time into a temporal matrix-vector decomposition. This allows us to
obtain temporally coherent reconstructions of human actors for long sequences,
while representing high-resolution details even in the context of challenging
motion. While most research focuses on synthesizing at resolutions of 4MP or
lower, we address the challenge of operating at 12MP. To this end, we introduce
ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160
cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We
demonstrate challenges that emerge from using such high-resolution data and
show that our newly introduced HumanRF effectively leverages this data, making
a significant step towards production-level quality novel view synthesis.Comment: Project webpage: https://synthesiaresearch.github.io/humanrf Dataset
webpage: https://www.actors-hq.com/ Video:
https://www.youtube.com/watch?v=OTnhiLLE7io Code:
https://github.com/synthesiaresearch/humanr